LLM 25-Day Course - Day 21: Preparing Fine-Tuning Datasets

Day 21: Preparing Fine-Tuning Datasets

The success of fine-tuning depends on data quality. No matter how good the model and techniques are, poor data leads to poor results. Today we learn the two most commonly used data formats and the preparation process.

Alpaca Format (Instruction Tuning)

The Alpaca format consists of 3 fields: instruction/input/output. It is suitable for single-turn Q&A tasks.

import json

# Alpaca format data examples
alpaca_data = [
    {
        "instruction": "Translate the given text to French.",
        "input": "The weather is beautiful today.",
        "output": "Le temps est magnifique aujourd'hui."
    },
    {
        "instruction": "Find and fix the bug in the following code.",
        "input": "def add(a, b):\n    return a - b",
        "output": "The operator is incorrect. `a - b` should be changed to `a + b`.\n\n```python\ndef add(a, b):\n    return a + b\n```"
    },
    {
        "instruction": "Explain Python's list comprehension.",
        "input": "",  # Empty string when no input is needed
        "output": "List comprehension is a concise syntax for creating a new list based on an existing list. It uses the form `[expression for element in iterable if condition]`."
    },
]

# Save as JSON file
with open("train_alpaca.json", "w", encoding="utf-8") as f:
    json.dump(alpaca_data, f, ensure_ascii=False, indent=2)

print(f"Saved: {len(alpaca_data)} samples")

ShareGPT Format (Multi-turn Conversation)

The ShareGPT format records multi-turn conversations. It is suitable for chatbot fine-tuning.

# ShareGPT format data examples
sharegpt_data = [
    {
        "conversations": [
            {"from": "system", "value": "You are a Python expert. Explain with code."},
            {"from": "human", "value": "What is the difference between a dictionary and a list?"},
            {"from": "gpt", "value": "A dictionary stores data as key-value pairs, while a list is an ordered collection of values.\n\n```python\n# List: access by index\nfruits = ['apple', 'banana', 'cherry']\nprint(fruits[0])  # apple\n\n# Dictionary: access by key\nprices = {'apple': 1.00, 'banana': 0.50}\nprint(prices['apple'])  # 1.00\n```"},
            {"from": "human", "value": "Can you use a list as a dictionary key?"},
            {"from": "gpt", "value": "No, lists are mutable so they cannot be used as dictionary keys. Keys must be hashable immutable objects. You can use tuples instead."},
        ]
    },
]

with open("train_sharegpt.json", "w", encoding="utf-8") as f:
    json.dump(sharegpt_data, f, ensure_ascii=False, indent=2)

Data Cleaning and Validation

Collected data must go through a cleaning process. Deduplication, quality filtering, and format validation are key.

from datasets import Dataset
import hashlib

def validate_and_clean(data):
    """Data quality validation and cleaning"""
    cleaned = []
    seen_hashes = set()
    issues = {"duplicate": 0, "too_short": 0, "empty_output": 0}

    for item in data:
        # Remove empty outputs
        if not item.get("output", "").strip():
            issues["empty_output"] += 1
            continue

        # Remove outputs that are too short (less than 10 characters)
        if len(item["output"].strip()) < 10:
            issues["too_short"] += 1
            continue

        # Deduplication (hash-based)
        content_hash = hashlib.md5(
            (item["instruction"] + item["output"]).encode()
        ).hexdigest()
        if content_hash in seen_hashes:
            issues["duplicate"] += 1
            continue
        seen_hashes.add(content_hash)

        cleaned.append(item)

    print(f"Original: {len(data)} -> After cleaning: {len(cleaned)}")
    print(f"Removal reasons: {issues}")
    return cleaned

# Run cleaning
cleaned_data = validate_and_clean(alpaca_data)

# Convert to Hugging Face datasets
dataset = Dataset.from_list(cleaned_data)
print(dataset)

# Train/validation split
split_dataset = dataset.train_test_split(test_size=0.1, seed=42)
print(f"Train: {len(split_dataset['train'])}, Validation: {len(split_dataset['test'])}")

Loading Hub Data with datasets Library

from datasets import load_dataset

# Load public datasets from Hugging Face Hub
dataset = load_dataset("tatsu-lab/alpaca", split="train")
print(f"Alpaca dataset: {len(dataset)} samples")
print(f"Columns: {dataset.column_names}")
print(f"\nFirst sample:\n{dataset[0]}")

# Korean dataset example
ko_dataset = load_dataset("heegyu/ko-chatgpt-qa", split="train")
print(f"\nKorean Q&A: {len(ko_dataset)} samples")

Today’s Exercises

Write 20 or more instruction-output pairs in Alpaca format from your area of expertise, and validate them with the validate_and_clean() function.
Write 5 conversation datasets in ShareGPT format with 3+ turns each, load them with the datasets library, and split into train/validation sets.
Find 3 or more Korean instruction datasets on Hugging Face Hub and compare their size, format, license, and quality.